# Eye
# Hair Brown Blue Hazel Green
# Black 36 9 5 2
# Brown 66 34 29 14
# Red 16 7 7 7
# Blond 4 64 5 8
User-friendly biplots in R
Centre for Multi-Dimensional Data Visualisation (MuViSU)
muvisu@sun.ac.za
SASA 2024
Aims to expose the association between two categorical variables.
Categorical variables measure characteristics of individuals (samples) in the form of finite discrete response levels (category levels).
Summarised in a two-way contingency table.
Focus placed on nominal categorical variables - category levels with no specific rank / order.
Numerous variants of CA are available for the application to diverse problems, the interested reader is referred to: Gower, Lubbe, and Roux (2011), Beh and Lombardo (2014).
With biplotEZ, focus is placed on three EZ-to-use variants (more information to follow).
Data matrix in CA() is different from PCA() and CVA().
\(\mathbf{X}:r\times c\), represents the cross-tabulations of two categorical variables.
The elements of the data matrix represent the frequency of the co-occurrence of two particular levels of the two variables.
Consider the HairEyeColor data set in R, which summarises the hair and eye color of male and female statistics students. For the purpose of this example only the male students will be considered:
CA() in biplotEZbiplot() |>
CA() |>interpolate() |> fit.measures() |>samples() |> newsamples() |>legend.type() |>plot()Take note of the warning message:
It is typical to express the frequencies in terms of proportions / probabilities.
Consider the correspondence matrix \(\mathbf{P}\):
# [,1] [,2] [,3] [,4]
# [1,] 0.1661342 0.000000 0.0000000 0.0000000
# [2,] 0.0000000 0.456869 0.0000000 0.0000000
# [3,] 0.0000000 0.000000 0.1182109 0.0000000
# [4,] 0.0000000 0.000000 0.0000000 0.2587859
\[\chi^2 = \frac{(\text{Observed freq.}-\text{Expected freq.})^2}{\text{Expected freq.}}\]
\[ \mathbf{S} = \mathbf{D_r}^{-\frac{1}{2}}(\mathbf{P}-\mathbf{rc'})\mathbf{D_c}^{-\frac{1}{2}}\]
In terms of the weighted row and column profiles (\(\mathbf{D_r}^{-\frac{1}{2}}\) and \(\mathbf{D_c}^{-\frac{1}{2}}\)).
The expected frequencies represented by the product of the row and column profiles ().
Biplot coordinates: singular value decomposition of \(\mathbf{S}\).
\[ \text{svd}(\mathbf{S}) = \mathbf{U\Lambda V'}\]
Variant refers to the contribution of the singular values (\(\Lambda\)) in the biplot solution.
default):\[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U\Lambda}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V}\end{aligned}\]
Row standard coordinate biplot:
\[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V\Lambda}\end{aligned}\]
Symmetric Correspondence Analysis map:
\[\begin{aligned} \text{Row coordinates: } \hspace{0.5 cm}&\mathbf{U\Lambda^{\frac{1}{2}}}\\ \text{Column coordinates: }\hspace{0.5 cm}& \mathbf{V\Lambda^{\frac{1}{2}}}\end{aligned}\]
CA function| Argument | Description |
|---|---|
bp |
Object of class biplot. |
dim.biplot |
Dimension of the biplot. Only values 1, 2 and 3 are accepted, with default 2. |
e.vects |
Which eigenvectors (principal components) to extract, with default 1:dim.biplot. |
variant |
which correspondence analysis variant, with default "Princ" |
lambda.scal |
TRUE or FALSE: Controls stretching or shrinking of column and row distances, with default FALSE. |